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Abstract 

In a matching paper [16], I proved that Conservation of Size and 
Information in a discrete token based system is overwhelmingly likely 
to lead to a power-law component size distribution with respect to 
the size of its unique alphabet. This was substantiated to a very 
high level of significance using some 55 million lines of source code of 
mixed provenance. The principle was also applied to show that average 
gene length should be constant in an animal kingdom where the same 
constraints appear to hold, the implication being that Conservation 
of Information plays a similar role in discrete token-based systems as 
the Conservation of Energy does in physical systems. 

In this part 2, the role of defect will be explored and a functional 
behaviour for defect derived to be consistent with the power-law be- 
haviour substantiated above. 

This will be supported by further experimental data and the im- 
plications explored. 
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1 Preliminaries 



1.1 Conservation of Information 

In [16] , I showed that given a discrete token based system of M components 
where the i th component contains U tokens, i = 1,..,M, conservation of a 
general quantity U and total Size T is overwhelmingly likely to lead to a size 
distribution which obeys 

where the total size is given by 

M 

r = J> ( 2 ) 

and there is some externally imposed entity £j associated with each token 
of component i whose total amount is given by 

M 

U = Y,tiSi (3) 

i=l 

I then showed that by identifying U with the total Hartley-Shannon in- 
formation content [10], [23], [21], [5], that the resulting predicted distribution 
takes on a power-law distribution given by 

v ' = % ~ ™" (4) 

where Pi is the probability of a component of size tokens occurring and 
cij is the size of the unique alphabet of tokens used to construct it. Here 

M 

w) = E e 4 ( 5 ) 

and the result is now subject to the twin constraints that the total number 
of tokens T is fixed 

M 



r = J> (6) 



i=i 



and the total Hartley / Shannon information content I, is also fixed 
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M 

i=i 

where ij is the information content of the i th component given by 

k = tilogai (8) 
and log denotes the natual logarithm. 

This result was substantiated against 55.5 million lines of source code in 6 
languages and demonstrated valid to a p level < 2.2. 10~ 16 . In other words, 
it is extremely unlikely that this result would have occurred by chance, (p 
levels < 0.01 are considered emphatic). 



1.2 Conservation of defect 

A defect in its simplest terms is a mistake. In software systems, a defect 
is some kind of mistake in the coding, (and there are many kinds [3], 
which causes the run-time behaviour of a program to depart from its expected 
behaviour. In a biological system, it might be a copying error in a gene. In 
both cases, we can imagine that there must be a total number of defects D 
given by 

M 



D = J2 di ( 9 ) 



i=l 



where di is the number of defects in the i th software component or gene. 
Following a similar development to |l3j . this can be written as 



,di. 



d = J2^t ] (10) 

i=l 1 

If we now identify E{ of equation as follows, 

(11) 

(in other words, each token of the i th component has a defect density 
associated with it given by (t 1 )), then the corresponding most likely distri- 
bution which maintains constant total defects and size is given from (pj) and 
(CD) by:- 



3 



However, we know from the measurements described by Fffif that real soft- 
ware systems obey to a very high level of certainty and so equating the 
distributions (jl]) and f lT2|) suggests that a defect conserving system will obey 
the following:- 

di ~ Uogoi (13) 

Note that this is an identical relationship to the information content of 
the i th component (JBj) because they are both conserved during variation. 

1.3 Some existing component defect models 

It is interesting at this point to pause for a moment and consider empirically 
observed distributions of defects in components of real software systems. The 
first thing to appreciate is that lines of code are inevitably used as a measure 
of program size in such studies. The reason for this is that lines of code 
are much easier to measure than the tokens used above, which require the 
development of compiler front-ends to measure properly, [16J. The downside 
however is that it is not a very precise measure in that lines of code can be 
defined in a number of ways, for example as a count of the newline characters 
as is most common, but it might also be a count of only those lines of code 
which cause a compiler to generate object code, (in which case they are 
known as executable lines of code). In addition, lines are layout based and 
therefore subject to stylistic interpretation whereas tokens are unambiguous. 
They are of course closely related but one programmer might typically use a 
smaller number of tokens per line than another as a matter of personal style. 

The ease of use of lines of code as a measure has meant that virtually all 
of the research into empirical distributions of defect uses lines of code as the 
independent variable leading to a relationship , di = di(rii), where rii is the 
number of lines of code in the i th component. 

There have been numerous attempts at modelling such defect behaviour 
as a function of component size, for example, [Tj, [H], [7], [5] and [T2]. In 
the absence of any models of defect growth, these are essentially exercises 
in data-fitting and all show at least linear growth in the number of defects 
with component size. In particular, pj3] and [12] both report logarithmic 
behaviour, and notably in the case of [19], di ~ n^Logrii. 



In this section, I have shown that subject to the constraints of constant de- 
fect (TJ|) and constant size a component size model strongly substantiated 
in fi6j / leads directly to a prediction that the defect distribution for a software 
system in equilibrium will obey the relation (T73j). 

This is a direct consequence of the principle of Conservation of Information 
and this will now be tested on a two very disparate real systems which have 
been in use for some considerable time and should therefore exhibit at least 
quasi-equilibriated behaviour. Note that in this sense, equilibriation refers 
to the process of continual use gradually flushing out residual defects so that 
as the number of discovered — >• D, (noting that we only know it is fixed, we 
do not know its value), the program becomes increasingly reliable or mature 
as it is commonly known. 

2 Application to software systems 

2.1 Experimental verification 

Validating the relationship (@J, although requiring the development of lexi- 
cal analysers capable of extracting the required tokens [16], is unambiguous. 
Such tokens are part of the definition of a programming language and when 
counted by separate experiments should yield the same results always, oth- 
erwise the language would have unacceptable ambiguity. 

The situation is not so simple for the measurement of defect. Such mea- 
surement almost always involves a measure of subjectivity, in the identifi- 
cation of the defect or even whether it is considered to be a defect at all. 
Further complications intrude such as the counting of two code fragments in 
separate locations which together produce a defect. Is this one occurrence 
or two ? Such questions have never and probably will never be resolved 
unambiguously so it should immediately be recognised that defect measure- 
ments are noisy. Token measurements are not, (unless the tokeniser itself is 
in error). 

2.2 Results 

With these comments in mind, two packages were initially selected to test 
the relationship in ( 1131) because both have an extensive and well-maintained 
defect history which can be mined by suitable tools. 
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Figure 1: The distribution of defects (y-axis) against t log(a) (x-axis) for the 
NAG library. The left hand graph shows all the data up to and including 
7 defects. The right hand graph shows those components with 0, 1 and 2 
defects, (more than 95% of the components). It should be re-iterated that 
each point in the right hand graph is the mean of many values of tloga which 
have the same number of defects. 



2.2.1 NAG scientific subroutine library (Fortran) 

The NAG Fortran scientific subroutine library was extensively analysed by 
[TS] . It has a detailed defect record embedded in its source code which the 
authors mined and associated with each component so it can be merged 
with measurements of £», a« made on the same code. For each defect up to a 
maximum of 7 per component, (very few components had more than this and 
were therefore excluded), the value of tloga was averaged and the resulting 
data is presented as Figure [TJ 

The predicted linearity of Figure CD was subjected to a standard test for 
significance using the linear modelling function lm() in the widely-used R 
statistical packagd^. This reported a high degree of linearity with an ad- 
justed linear-fit correlation of 0.89, a high level of linear correlation with an 
associated p-value, (the probability of finding a dataset more unlikely than 
this one by chance) of 0.0002544, an emphatic result. The corresponding 
output from R is shown below. (Note that in the R analysis, the tloga values 
were normalised by a factor of 5000.0.) 

lm(formula = y ~ x, data = universe) 

Residuals : 

Min 1Q Median 3Q Max 

1 http://www. r-project.org/ 
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-0.7120 -0.4648 -0.3056 0.2195 1.4967 



Coefficients : 

Estimate Std 
(Intercept) -0.6021 
x 2 . 2439 



Error t value Pr(>|t|) 
0.6048 -0.995 0.357931 
0.2921 7.683 0.000254 *** 



Residual standard error: 0.8036 on 6 degrees of freedom 
Multiple R-squared: 0.9077, Adjusted R-squared: 0.8924 

F-statistic: 59.03 on 1 and 6 DF, p-value: 0.0002544 



2.2.2 Eclipse IDE (Java) 

The Eclipse IDE written in Java is another example of a well-instrumented 
software package. In this case, the hard work of extracting defect records and 
associating them with particular components has already been done by [22 Jl. 
All that was necessary here was to extract all the U,ai using the methods 
described above and in [TH] and the data plotted for all components with 
up to 12 defects, (again very few components contained more than this and 
were consequently excluded). Again the value of tloga was normalised by 
a convenient factor of 5000.0 before analysis with R. The results this time 
were:- 



lm(formula = y ~ x, data = universe) 

Residuals : 

Min 1Q Median 3Q Max 

-1.9848 -0.6129 -0.2032 0.6618 1.7910 



Coefficients : 

Estimate Std 
(Intercept) 0.3256 
x 1.5324 



Error t value Pr(>|t|) 
0.5874 0.554 0.59 
0.1340 11.435 1.91e-07 *** 



Residual standard error: 1.133 on 11 degrees of freedom 
Multiple R-squared: 0.9224, Adjusted R-squared: 0.9153 

F-statistic: 130.8 on 1 and 11 DF, p-value: 1.907e-07 

2 See also http://www.st.cs.uni-sb.de/softevol The data comes from releases 2.0,2.1 and 
3.0. There are 10,613 components in the release 3.0 
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Defects with tloga Defects with tloga (95% data) 




20000 



Figure 2: The distribution of defects (y-axis) against t log(a) (x-axis) for 
the Eclipse IDE library. The left hand graph shows all the data up to and 
including 12 defects. The right hand graph shows those components with 0-5 
defects, (more than 95% of the components). Again each point in the right 
hand graph is the mean of many values of tloga which have the same number 
of defects. 



giving an adjusted R-squared of 0.92. This represents an even high quality 
linear fit due to the larger quantity of data. 



To summarise these two experiments, even though defect data are inher- 
ently much noisier than token measurements, the degree of linearity predicted 
by (fTBl) is well supported. 



2.3 Equilibriation 

This result may contribute to answering a difficult question in software engi- 
neering - "How can you tell when a software component has been thoroughly 
tested ?" This attempts to place into words the perceived property of a sys- 
tem which on continuous running in diverse environments, fails very rarely 
in some sense. The problem is that when a product is shipped for the first 
time, a low early defect measure says nothing about the future behaviour 
unless it is linked with a substantial run-time history. To put into the form 
of a simple aphorism: 

There are two ways of achieving low defect: the first is to have 
a very good system, and the second is to have very poor testing. 

We obviously prefer the former. However the development which led up 
to (1131) considers its equilibriation as shown with a number of systems in [16]. 
In other words, departures from ffT3l) may tell us how well the system has 
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been tested. It turns out that in the case of the NAG library, 95% of the 
components have exhibited 0, 1 or 2 defects. The same 95% cut-off when 
applied to the Eclipse data covers 0-5 defects. Inspecting Figures Q] and [2] 
for these values shows that they are very highly linear here with significant 
departures only appearing for a small population of components. If this 5% 
of components are excluded in both cases, the adjusted R-squared values 
reach 0.99 in both cases. 

/ therefore propose that the adjusted R-squared value for lin- 
ear fit could be used to determine how well code has been tested 
simply from its defect records. If there is no substantial evidence 
of linearity up to a number of defects corresponding to say, 95% 
of the defect data, then the defect data has not yet equilibriated 
and it is likely that there are more defects to be found. This is 
a form of reliability growth modelling in which the temporal axis 
of reliability growth is replaced by departures from linearity of an 
asymptotic defect distribution shaped by the Conservation of In- 
formation. 

Software defect data are not easy to work with as has already been dis- 
cussed but it is hoped that this will inspire further experiments to test the 
asymptotic result of (JT3J) . 

3 Application to genetic systems 

I will now apply the general principle expressed by ( TT3|) to predicting defect 
properties of genes. No experimental evidence will be presented for this here 
as this is the subject of a companion paper, [T7] . 

In [16] , I demonstrated that the principle of Conservation of Information 
predicts that gene length is uniformly distributed, a direct result of the fixed 
4-base ACGT alphabet of the genome. In turn this implies that the ratio 
of total sequence length / number of genes is constant. This prediction is 
well supported by the experiments of [22] and for continuity, I will repeat 
some of their comments here. They surveyed almost all prokaryotic and 
eukaryotic species whose complete genome sequence data were then available 
and well annotated. These data included 81 prokaryotes and regressed the 
estimate of total coding sequence length against the estimate of the number 
of genes for each of the two groups of species. They found that although 
the average lengths of genes in prokaryotes and eukaryotes are significantly 
different, the average lengths of genes are highly conserved within either of the 
two kingdoms. They concluded that natural selection has clearly set a strong 
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Total number of genes 




Total number of genes 



Figure 3: Linear regression analysis of the total sequencing length against 
the number of genes shown by |25j. 



limitation on gene elongation within the kingdom and that the average gene 
size adds another distinct characteristic for the discrimination between the 
two kingdoms of organisms. Their data is reproduced by kind permission as 
Figure El 

3.1 The growth of defects on genes 

As noted in [16J, genes have length £, bases chosen from a unique alphabet of 
a, bases, however the alphabet of bases in genetic codes is fixed to adenine, 
cytosine, guanine and thiamine. In other words, Oj = 4,Vz. Using f[T3"j) for 
the genome then gives 

di oc U (14) 

In other words, Conservation of defect in a system with uniform proba- 
bility distribution for gene length implies that by far the most likely outcome 
is that genetic defects are linearly proportional to gene length. 

This is considered in much more detail including the effects of kingdoms 
in a companion paper [17J. 
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4 Comparisons with Halstead 



This is not the first time in which defects have been related to tokens of 
programming languages. Halstead [8], [9] made an intensive study of the 
relationship between Shannon information theory and programming language 
defining a number of heuristics, program volume, effort, information content 
and so on. The current work is based on [T3], [14], and takes a different 
approach combining Shannon information theory with concepts of statistical 
mechanics. This avoids emphasising the meaning of tokens and simply refers 
to the choices which can be made as described by [5]. 

5 On defect growth generally 

Defect growth in systems has been widely studied for a number of years using 
a variety of reliability growth models, [I], [21], [2], [20] • In spite of this, it 
is still relatively rare to find good defect data which can be analysed for the 
verification of models such as that proposed here. The Open Source move- 
ment has improved this along with tools such as Bugzillao but the situation 
is still not as good as for the analysis of token distributions in open source 
in part 1 of this paper, [16] . In particular, equilibriation to the predicted 
distribution f[T5j) is not well covered as it requires meticulous defect records 
from the early days of a large system and these need to be associated with 
particular components as done in an exemplary fashion by [22] with Eclipse. 

6 On equilibriation and tokens 

Perhaps the most difficult idea to grasp in using variational principles like 
this is that such principles are ergodic. They are not talking about a single 
system but about all possible systems. In other words, when total size is con- 
strained to say T tokens, this does not mean that the results are only relevant 
to systems with this size. Instead, all the variational method says is that if 
the totality of all possible systems of size T are considered, then an over- 
whelmingly large number of them will produce a component size distribution 
obeying fll]). 

If I select a particular system and change its size in some way to T' as 
occurs in both software development, through incremental change and also 
in genetic development, through the usual mechanisms of natural selection 
and mutation, then it simply becomes one of the totality of systems of size 

http: / /www. bugzilla.org/ 
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T\ The variational method knows nothing of this and indeed doesn't care. 
I could, because I have free choice, develop a software system of size T with 
M components in the programming language C all but one of which contains 
the same static function definition along with an empty main() component. 
It will compile, link, run and be exceptionally uninteresting in every way 
except that it will not obey a power-law in its component distribution. It is 
however, just one of the totality of programs of size T, which overwhelmingly 
will obey such a power-law. 

I could embark on a crusade and try to persuade every programmer on 
earth to write the same program for the rest of their lives and to pass this 
on to their descendants in order to break the power-law distribution but it is 
not very likely and in any case, ergodically speaking, it does not cover every 
system of T tokens. Finally, in a perfect gas, which is where such variational 
methods were honed, it is perfectly possible that all the molecules in a room 
will suddenly find themselves under a table so it shoots into the air, but it 
is not very likely. 

It is also worth saying something about tokens in general and unique al- 
phabets of tokens in particular. When deciding on a unique alphabet in a 
programming language, it is easy to find token combinations which are de- 
pendent on each other. For example, again from the programming language 
C, the token "if" must be followed by the token "(". Anything else is syn- 
tactically illegal in C. Does this then mean that these are one token or two ? 
The answer to this conundrum lies with information theory. As I discussed 
at some length in [IB] , the meaning of the tokens is irrelevant in this context. 
Information content is only about choice, not meaning, so there are indeed 
two tokens. 

7 Conclusions 

The paper presents several contributions. 

• Using variational principles suggested in [IB] , [T3] and using the princi- 
ple of the Conservation of Information, it is predicted that the number 
of defects in any component of size ti tokens constructed from a unique 
alphabet of a« tokens, will equilibriate to a distribution given by, 

di ~ ti.log(ai) (15) 

Substantial evidence in favour of this was presented using a large For- 
tran system, (the NAG library) and a Java system, (the Eclipse IDE). 
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Note that this distribution has clustering properties. Observation of 
defect clustering is described in [T5] . 

• It is proposed that departures from ffT5]) could be used to measure the 
degree of equilibriation which has taken place, specifically the adjusted 
R- value of a linear fit. 

• The underlying principle of Conservation of Information and constant 
total number of defects lead to a prediction that the number of defects 
in a gene is linearly proportional to its length. This raises a number of 
issues and is being considered separately in a companion paper [T7j . 
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